hamming_loss (bitwise error rate for multilabel classification)#
hamming_loss measures the fraction of labels that are wrong.
For standard (single-label) classification it reduces to the misclassification rate.
For multilabel classification it averages mistakes across the
(sample, label)grid — how many bits did we flip?
Learning goals#
write the multiclass and multilabel formulas (with clear notation)
build intuition with plots (what counts as an error)
implement Hamming loss from scratch in NumPy (including
sample_weight)see how Hamming loss interacts with probability thresholds in multilabel logistic regression
know pros/cons and when to prefer other metrics
Quick import#
from sklearn.metrics import hamming_loss
Table of contents#
Definitions and notation
Intuition (plots)
NumPy implementation + sanity checks
Using Hamming loss for threshold tuning (multilabel logistic regression)
Pros, cons, pitfalls
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import hamming_loss as sk_hamming_loss
from sklearn.model_selection import train_test_split
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.random.seed(0)
np.set_printoptions(precision=3, suppress=True)
1) Definitions and notation#
Assume we have \(n\) samples.
Single-label classification (binary or multiclass)#
True label: \(y_i \in \{0,1,\dots,K-1\}\)
Predicted label: \(\hat{y}_i \in \{0,1,\dots,K-1\}\)
The Hamming loss is the average number of wrong labels per sample:
For single-label classification this is exactly the misclassification rate (a.k.a. zero_one_loss).
Multilabel classification (label indicator matrix)#
True labels: \(Y \in \{0,1\}^{n\times L}\) (each row can have multiple 1s)
Predictions: \(\hat{Y} \in \{0,1\}^{n\times L}\)
Hamming loss counts mismatches over all (sample, label) decisions:
Equivalently, it is the average Hamming distance per sample, normalized by \(L\).
Relationship to micro-accuracy#
If you treat each (sample, label) as a binary decision, then:
Contrast with subset accuracy (exact match)#
Subset accuracy (a.k.a. exact match ratio) for multilabel requires getting all labels correct for a sample:
Hamming loss is more forgiving: getting 1 label wrong out of 20 is a small penalty, while subset accuracy would count the whole sample as wrong.
2) Intuition (plots)#
Think of each label decision as a bit.
0means perfect predictions.0.25means 25% of all bits are wrong.
Below we visualize Y_true, Y_pred, and the mismatch matrix (Y_true != Y_pred).
Y_true = np.array(
[
[1, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 0],
[0, 0, 0, 1, 0, 0],
[1, 0, 1, 1, 0, 0],
[0, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
],
dtype=int,
)
Y_pred = np.array(
[
[1, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 1, 0],
[1, 0, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0],
],
dtype=int,
)
mismatch = (Y_true != Y_pred).astype(int)
hl_manual = float(mismatch.mean())
hl_sklearn = float(sk_hamming_loss(Y_true, Y_pred))
subset_acc = float(np.mean(np.all(Y_true == Y_pred, axis=1)))
print(f'Hamming loss (manual) : {hl_manual:.3f}')
print(f'Hamming loss (sklearn): {hl_sklearn:.3f}')
print(f'Subset accuracy : {subset_acc:.3f}')
Hamming loss (manual) : 0.188
Hamming loss (sklearn): 0.188
Subset accuracy : 0.125
n_samples, n_labels = Y_true.shape
x_labels = [f'label_{j}' for j in range(n_labels)]
y_labels = [f'sample_{i}' for i in range(n_samples)]
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=['Y_true', 'Y_pred', 'Mismatch (1 = wrong)'],
)
fig.add_trace(
go.Heatmap(
z=Y_true,
x=x_labels,
y=y_labels,
colorscale='Blues',
zmin=0,
zmax=1,
showscale=False,
),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(
z=Y_pred,
x=x_labels,
y=y_labels,
colorscale='Greens',
zmin=0,
zmax=1,
showscale=False,
),
row=1,
col=2,
)
fig.add_trace(
go.Heatmap(
z=mismatch,
x=x_labels,
y=y_labels,
colorscale=[[0, '#ffffff'], [1, '#d62728']],
zmin=0,
zmax=1,
showscale=False,
),
row=1,
col=3,
)
fig.update_layout(
title=f'Hamming loss = {hl_manual:.3f} (fraction of wrong bits)',
height=420,
)
fig.show()
per_sample = mismatch.mean(axis=1)
per_label = mismatch.mean(axis=0)
fig1 = px.bar(
x=[f'sample_{i}' for i in range(n_samples)],
y=per_sample,
title='Per-sample contribution: fraction of wrong labels',
labels={'x': 'sample', 'y': 'wrong-label fraction'},
)
fig1.add_hline(y=hl_manual, line_dash='dash', annotation_text='global HL')
fig1.update_yaxes(range=[0, 1])
fig1.show()
fig2 = px.bar(
x=x_labels,
y=per_label,
title='Per-label error rate',
labels={'x': 'label', 'y': 'error rate'},
)
fig2.add_hline(y=hl_manual, line_dash='dash', annotation_text='global HL')
fig2.update_yaxes(range=[0, 1])
fig2.show()
A common pitfall: multiclass as one-hot vs integer labels#
For multiclass problems you often have one true class per sample.
If you pass integer labels (
shape = (n,)), Hamming loss is the misclassification rate.If you convert to one-hot (
shape = (n, K)), a single wrong prediction creates two bit errors (one FN + one FP), so the value changes.
Below we compare the two representations.
y_true_mc = np.array([0, 1, 2, 2, 1, 0])
y_pred_mc = np.array([0, 2, 2, 1, 1, 0])
hl_int = float(sk_hamming_loss(y_true_mc, y_pred_mc))
K = 3
Y_true_oh = np.eye(K, dtype=int)[y_true_mc]
Y_pred_oh = np.eye(K, dtype=int)[y_pred_mc]
hl_onehot = float(sk_hamming_loss(Y_true_oh, Y_pred_oh))
print(f'Hamming loss with integer labels: {hl_int:.3f}')
print(f'Hamming loss with one-hot labels: {hl_onehot:.3f} (note the scaling)')
Hamming loss with integer labels: 0.333
Hamming loss with one-hot labels: 0.222 (note the scaling)
3) NumPy implementation + sanity checks#
A from-scratch implementation is straightforward once you remember the definition: count mismatches and average.
sample_weight in sklearn.metrics.hamming_loss applies at the sample level:
compute per-sample Hamming loss (mean mismatches across labels)
take a weighted average across samples
def hamming_loss_np(y_true, y_pred, *, sample_weight=None) -> float:
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f'shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}')
if y_true.ndim == 1:
mismatches = (y_true != y_pred).astype(float)
if sample_weight is None:
return float(mismatches.mean())
w = np.asarray(sample_weight, dtype=float)
if w.shape != (y_true.shape[0],):
raise ValueError(f'sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}')
return float(np.average(mismatches, weights=w))
if y_true.ndim == 2:
mismatches = (y_true != y_pred).astype(float)
per_sample = mismatches.mean(axis=1)
if sample_weight is None:
return float(per_sample.mean())
w = np.asarray(sample_weight, dtype=float)
if w.shape != (y_true.shape[0],):
raise ValueError(f'sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}')
return float(np.average(per_sample, weights=w))
raise ValueError('y_true and y_pred must be 1D (single-label) or 2D (multilabel)')
# Sanity checks vs scikit-learn
rng = np.random.default_rng(0)
# 1) single-label (multiclass)
y_true_1d = rng.integers(0, 4, size=200)
y_pred_1d = rng.integers(0, 4, size=200)
print(
'1D close?',
np.allclose(
hamming_loss_np(y_true_1d, y_pred_1d),
sk_hamming_loss(y_true_1d, y_pred_1d),
),
)
# 2) multilabel indicator
Y_true_2d = rng.integers(0, 2, size=(120, 7))
Y_pred_2d = rng.integers(0, 2, size=(120, 7))
print(
'2D close?',
np.allclose(
hamming_loss_np(Y_true_2d, Y_pred_2d),
sk_hamming_loss(Y_true_2d, Y_pred_2d),
),
)
# 3) sample weights
w = rng.random(size=Y_true_2d.shape[0])
hl_np_w = hamming_loss_np(Y_true_2d, Y_pred_2d, sample_weight=w)
hl_sk_w = float(sk_hamming_loss(Y_true_2d, Y_pred_2d, sample_weight=w))
print('weighted close?', np.allclose(hl_np_w, hl_sk_w))
print('weighted value:', hl_np_w)
1D close? True
2D close? True
weighted close? True
weighted value: 0.528672497390485
4) Using Hamming loss for threshold tuning (multilabel logistic regression)#
Hamming loss is defined on hard predictions (0/1), so it is not differentiable.
A common pattern is:
Train a probabilistic model (e.g. multilabel logistic regression) by minimizing a differentiable surrogate (binary cross-entropy /
log_loss).Convert probabilities to hard labels with a threshold \(t\).
Choose \(t\) (or per-label thresholds) to minimize Hamming loss on a validation set.
Model#
For \(L\) labels, we use independent sigmoids:
Prediction with a threshold \(t\):
We will train with average binary cross-entropy (from logits):
Then we will tune \(t\) to minimize Hamming loss.
def sigmoid(z):
z = np.asarray(z, dtype=float)
return 1.0 / (1.0 + np.exp(-z))
def softplus(z):
# Stable softplus: log(1 + exp(z))
z = np.asarray(z, dtype=float)
return np.log1p(np.exp(-np.abs(z))) + np.maximum(z, 0.0)
def bce_from_logits(Y, Z) -> float:
Y = np.asarray(Y, dtype=float)
Z = np.asarray(Z, dtype=float)
return float(np.mean(softplus(Z) - Y * Z))
def standardize_fit_transform(X):
X = np.asarray(X, dtype=float)
mean = X.mean(axis=0)
std = X.std(axis=0)
std = np.where(std == 0, 1.0, std)
return (X - mean) / std, mean, std
def standardize_transform(X, mean, std):
X = np.asarray(X, dtype=float)
std = np.where(std == 0, 1.0, std)
return (X - mean) / std
def fit_multilabel_logreg_gd(
X_train,
Y_train,
X_val=None,
Y_val=None,
*,
lr=0.8,
n_steps=400,
l2=0.0,
threshold=0.5,
):
X_train = np.asarray(X_train, dtype=float)
Y_train = np.asarray(Y_train, dtype=float)
n_samples, n_features = X_train.shape
n_labels = Y_train.shape[1]
W = np.zeros((n_features, n_labels))
b = np.zeros(n_labels)
history = {
'step': [],
'train_bce': [],
'train_hl': [],
'val_bce': [],
'val_hl': [],
}
for step in range(n_steps):
Z = X_train @ W + b
P = sigmoid(Z)
train_bce = bce_from_logits(Y_train, Z)
# dJ/dZ = (P - Y) / (n_samples * n_labels) when J is the mean over all entries
G = (P - Y_train) / (n_samples * n_labels)
grad_W = X_train.T @ G + l2 * W
grad_b = G.sum(axis=0)
W -= lr * grad_W
b -= lr * grad_b
Y_hat = (P >= threshold).astype(int)
train_hl = hamming_loss_np(Y_train.astype(int), Y_hat)
history['step'].append(step)
history['train_bce'].append(train_bce)
history['train_hl'].append(train_hl)
if X_val is not None and Y_val is not None:
Z_val = X_val @ W + b
P_val = sigmoid(Z_val)
val_bce = bce_from_logits(Y_val, Z_val)
val_hl = hamming_loss_np(Y_val.astype(int), (P_val >= threshold).astype(int))
history['val_bce'].append(val_bce)
history['val_hl'].append(val_hl)
else:
history['val_bce'].append(None)
history['val_hl'].append(None)
return W, b, history
# Synthetic multilabel dataset
rng = np.random.default_rng(1)
n_samples = 1600
n_features = 8
n_labels = 6
X = rng.normal(size=(n_samples, n_features))
W_true = rng.normal(scale=1.2, size=(n_features, n_labels))
# Make some labels rarer than others by shifting biases
b_true = np.linspace(-2.0, 0.5, n_labels)
Z_true = X @ W_true + b_true
P_true = sigmoid(Z_true)
Y = (rng.random(size=P_true.shape) < P_true).astype(int)
X_train, X_val, Y_train, Y_val = train_test_split(
X,
Y,
test_size=0.3,
random_state=0,
)
X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)
W, b, hist = fit_multilabel_logreg_gd(
X_train_s,
Y_train,
X_val=X_val_s,
Y_val=Y_val,
lr=0.9,
n_steps=300,
l2=0.0,
threshold=0.5,
)
fig = make_subplots(specs=[[{'secondary_y': True}]])
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_bce'], name='train BCE'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_bce'], name='val BCE'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_hl'], name='train Hamming loss'), secondary_y=True)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_hl'], name='val Hamming loss'), secondary_y=True)
fig.update_xaxes(title_text='gradient descent step')
fig.update_yaxes(title_text='binary cross-entropy (lower is better)', secondary_y=False)
fig.update_yaxes(title_text='Hamming loss (lower is better)', secondary_y=True, range=[0, 1])
fig.update_layout(title='Train with BCE, monitor Hamming loss at threshold=0.5', height=480)
fig.show()
# Tune the probability threshold to minimize validation Hamming loss
Z_val = X_val_s @ W + b
P_val = sigmoid(Z_val)
thresholds = np.linspace(0.05, 0.95, 91)
hl_vals = []
for t in thresholds:
Y_hat_val = (P_val >= t).astype(int)
hl_vals.append(hamming_loss_np(Y_val, Y_hat_val))
hl_vals = np.array(hl_vals)
best_idx = int(np.argmin(hl_vals))
best_t = float(thresholds[best_idx])
t05_idx = int(np.where(np.isclose(thresholds, 0.5))[0][0])
hl_at_05 = float(hl_vals[t05_idx])
hl_best = float(hl_vals[best_idx])
print(f'Validation HL at t=0.50: {hl_at_05:.4f}')
print(f'Best threshold t*: {best_t:.2f}')
print(f'Validation HL at t*: {hl_best:.4f}')
Validation HL at t=0.50: 0.1437
Best threshold t*: 0.51
Validation HL at t*: 0.1420
fig = px.line(
x=thresholds,
y=hl_vals,
title='Validation Hamming loss vs threshold',
labels={'x': 'threshold t', 'y': 'Hamming loss'},
)
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='t=0.5')
fig.add_vline(x=best_t, line_dash='dash', line_color='green', annotation_text='best t*')
fig.update_yaxes(range=[0, 1])
fig.show()
# Optional: per-label threshold tuning (can reduce HL when base rates differ)
per_label_thresholds = np.zeros(n_labels)
for j in range(n_labels):
errs = []
for t in thresholds:
pred_j = (P_val[:, j] >= t).astype(int)
errs.append(float(np.mean(pred_j != Y_val[:, j])))
per_label_thresholds[j] = thresholds[int(np.argmin(errs))]
Y_hat_per_label = (P_val >= per_label_thresholds).astype(int)
hl_per_label = hamming_loss_np(Y_val, Y_hat_per_label)
print('Per-label thresholds:', np.round(per_label_thresholds, 2))
print('Validation HL (single t*) :', hl_best)
print('Validation HL (per-label) :', hl_per_label)
fig = px.bar(
x=[f'label_{j}' for j in range(n_labels)],
y=per_label_thresholds,
title='Per-label thresholds that minimize per-label error',
labels={'x': 'label', 'y': 'best threshold'},
)
fig.update_yaxes(range=[0, 1])
fig.show()
Per-label thresholds: [0.48 0.68 0.46 0.51 0.46 0.5 ]
Validation HL (single t*) : 0.14201388888888886
Validation HL (per-label) : 0.13819444444444445
5) Pros, cons, pitfalls#
Pros#
Simple and interpretable: “fraction of wrong labels.”
Works naturally for multilabel: does not require perfect set matches.
Label-wise averaging: each label decision contributes equally (micro view over all bits).
Comparable across models when the label space is fixed (same \(L\)).
Cons / caveats#
Can look deceptively good on sparse multilabel problems: if most labels are 0, predicting all zeros yields many true negatives and a low Hamming loss.
Does not capture set quality: predicting a wrong combination can still have a small Hamming loss if only a few bits differ.
Not differentiable: not suitable as a direct gradient-based training objective; use a surrogate loss and treat Hamming loss as an evaluation metric.
Representation matters for multiclass: integer labels vs one-hot produce different scales.
Common pitfalls#
Passing probabilities instead of hard labels (threshold them first).
Using one-hot for multiclass and interpreting the value as misclassification rate.
Relying on Hamming loss alone with heavy class imbalance; complement with per-label precision/recall/F1, Jaccard score, or subset accuracy.
Where it’s a good fit#
Multilabel tagging where each label decision matters roughly equally (e.g. topic tags, attribute prediction).
Problems where you want a single number that reflects “average per-label error rate,” not strict exact matches.
Exercises#
Show algebraically that for multilabel indicators,
Hamming loss = 1 - micro-accuracy.Construct a sparse multilabel dataset where predicting all zeros achieves a low Hamming loss but terrible recall.
Implement a per-label F1 score and compare its behavior to Hamming loss under imbalance.
References#
scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html
Hamming distance (background): https://en.wikipedia.org/wiki/Hamming_distance